2022-11-03

The Basic Problem

Hypotheses about the relationship between a dependent variable (regressand, response, outcome) and a set of independent variables (regessors, predictors, features).

Model: \(\boldsymbol{y}=\boldsymbol{X\beta}+\boldsymbol{\epsilon}\)

Most commonly, \(\boldsymbol{\beta}\) is estimated with ordinary least squares (sum of squared residuals is minimized).

A typical questions are:

  • How are variables A, B, C, … linked to variable Y?
  • Can A, B, C, D … explain/predict Y?

Assumptions of Ordinary Least Squares Regression

The acronym LINE can be used to recall the assumptions required for making inferences and predictions with models based on OLS. If we consider a simple linear regression with just a single predictor X, then:

  • L: There is a linear relationship between Y and X,
  • I: The errors are independent—there’s no connection between how far any two points lie from the regression line,
  • N: Y is normally distributed at each level of X, and
  • E: The variance of Y is equal for all levels of X.

Assumptions of Ordinary Least Squares Regression

These assumptions are depicted in the figure

  • L: The mean value for Y at each level of X falls on the regression line.
  • I: We’ll need to check the design of the study to determine if the errors (vertical distances from the line) are independent of one another.
  • N: At each level of X, the values for Y are normally distributed.
  • E: The spread in the Y’s for each level of X is the same.

How to deal with those assumptions

General

  • don’t test them
    • dependent on n
    • those have Type 1 & 2 errors as well -> due to serial testing the global error rates are off
  • visually inspect variables before analysis and residuals afterwards
  • use robust techniques
    • bootstrapping - should be able to deal with moderate heteroscedasticity, outliers, influential cases
    • trimming, winsorizing
    • robust least squares, e.q. Iteratively Reweighted Least Squares (IRLS). Base idea of IRLS: Cases with bigger residuals get less weight. How to: just replace lm() with MASS::rlm()
  • robust techniques can be used alone or combined; Tip: always use bootstrapping and add rest when needed

How to deal with those assumptions

Independence of errors

  • there is no test/plot to evaluate the independence assumption
  • evidence for lack of independence comes from knowing about the study design and methods of data collection
  • use appropriate analyses (Multilevel regression)
  • sometimes people recommend ignoring the nested structure, when the intraclass correlation (ICC) is low – ignore them

How to deal with those assumptions

Linearity and Normal distribution

  • Generalized linear models
  • non-linear transformations of y
  • add non-linear transformations of xs
    • x²
    • log(x)
    • x1*x2
    • …

How to deal with those assumptions

Homoscedasticity

  • often a modification/transformation of y is helpful
  • especially, the use of rates, e.g.
    • accidents/population instead of total number of accidents
    • BMI instead of weight
    • count/time unit instead of count
    • …

Let’s continue with an example

Research question

Can life satisfaction be inferred from

  • marital status
  • materialism
  • neurotiscism

?

Data

Source: Górnik-Durose, Malgorzata E. Materialism & personality. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-15. https://doi.org/10.3886/E101900V1

Publication: Górnik-Durose, M. E. (2020). Materialism and well-being revisited: The impact of personality. Journal of Happiness Studies: An Interdisciplinary Forum on Subjective Well-Being, 21(1), 305–326. https://doi.org/10.1007/s10902-019-00089-8

[Note: we will use data from Study 3, but carry out other analyses than described in the original paper]

Variables we will use:

  • marital_status [single/civil partnership/married/other]
  • materialism [Polish version of the 9-item Material Values Scale]
  • neuroticism [subscale of Polish version of NEO Five-Factor Inventory]
  • life_satisfaction [Polish version of Satisfaction with Life Scale]

Inital exploratory analyses

skim through the raw data

Inital exploratory analyses

univariate

df_mat %>% DataExplorer::plot_bar()

df_mat %>% DataExplorer::plot_density()

Inital exploratory analyses

bivariate

Digression: Multicollinearity

  • the interpretation of a regression coefficient is that it represents the mean change in the DV for each 1 unit change in the IV if all others are held constant
  • the last part does not work if IVs are correlated
  • coefficient estimation becomes unstable; ultimately, it increases the SEs (affects ps)
  • can be quantified with Variance Inflation Factors (VIF > 5 is critical)
  • in R you can calculate it with car::vif(model)
  • solutions:
    • remove part of the highly correlated IVs
    • combine them (sums, PCA, …)
    • use regularization

Let’s start simple

simple linear regression with a continuous predictor

Starting point: life satisfaction as a function of materialism

Model 1: \(Y_{i}=\beta_0 + \beta_1 \text{materialism}_i + \epsilon_i\)

model1 <- lm(life_satisfaction ~ materialism, data = df_mat)
summary(model1)
## 
## Call:
## lm(formula = life_satisfaction ~ materialism, data = df_mat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5905  -3.5044   0.1866   3.8489  14.6390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.94116    1.21265  23.042  < 2e-16 ***
## materialism -0.22294    0.04352  -5.123 4.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.442 on 358 degrees of freedom
## Multiple R-squared:  0.06829,    Adjusted R-squared:  0.06569 
## F-statistic: 26.24 on 1 and 358 DF,  p-value: 4.937e-07

Centering

= subtracting the mean value from all values -> new mean = 0

\(c_{i}=x_i - \overline{x}\)
  • applicable to \(\boldsymbol{y}\) and \(\boldsymbol{X}\)
  • it is a linear transformation and only changes intercept (intercept = estimated value of the DV if all IVs were 0)
  • can facilitate interpretation of the intercept, if 0 is no meaningful reference (e.g. height of person), but the mean value is
  • reduce collinearity if non-linear transforms (e.g., squares, interactions, …)

Standardizing

= centering + scaling with SD

\(z_i=\frac{x - \overline{x_i}}{s_x}\)
  • can facilitate interpretation of the weights (unit of change is SD instead of the original scale)
  • types:
    • yx standardization both y and x are standardized: if x changes by \(1{s_x}\) , then y changes by \(\beta_x s_y\)
    • y standardization only y is standardized: if x changes by 1 unit, then y changes by \(\beta_x s_y\)
    • x standardization only x is standardized: if x changes by \(1{s_x}\) , then y changes by \(\beta_x\) units
  • changes weights; SE scale likewise -> no change in \(p\text{s}\) (if there is no interaction term)
  • don’t standardize binary/categorical predictors

Centering/Scaling for Model 1?

df_z <- df_mat %>% mutate_if(is.numeric, scale)
model2 <- lm(life_satisfaction ~ materialism, data = df_z)
summary(model2)
## 
## Call:
## lm(formula = life_satisfaction ~ materialism, data = df_z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.76923 -0.62247  0.03314  0.68365  2.60022 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.351e-16  5.094e-02   0.000        1    
## materialism -2.613e-01  5.102e-02  -5.123 4.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9666 on 358 degrees of freedom
## Multiple R-squared:  0.06829,    Adjusted R-squared:  0.06569 
## F-statistic: 26.24 on 1 and 358 DF,  p-value: 4.937e-07

Continue simple

simple linear regression with a categorical predictor

model3 <- lm(life_satisfaction ~ marital_status, data = df_z)
summary(model3)
## 
## Call:
## lm(formula = life_satisfaction ~ marital_status, data = df_z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58066 -0.57286  0.08132  0.61597  2.56804 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                      -0.2422     0.1085  -2.232   0.0262 *
## marital_statuscivil partnership   0.3529     0.1455   2.425   0.0158 *
## marital_statusmarried             0.2989     0.1351   2.213   0.0275 *
## marital_statusother               0.2256     0.2699   0.836   0.4037  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9885 on 350 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.01903,    Adjusted R-squared:  0.01062 
## F-statistic: 2.263 on 3 and 350 DF,  p-value: 0.0809

Categorical predictors

1 variable and 1 \(\beta\) are not enough (#categories - 1 are needed)

  • dummy coding - compares each level to the reference level, intercept being the cell mean of the reference group (R did that just before)
x x1 x2 x3
single 0 0 0
civil p.  1 0 0
married 0 1 0
other 0 0 1
  • simple coding - compares each level to the reference level, intercept being the grand mean (instead of 0/1, (-1/k)/([k-1]/k))
  • deviation coding - compares each level to the grand mean
  • difference coding - compares adjacent levels (forward: with next, backward: with previous)
  • see e.g. R package contrast for more (orthogonal polynomial, Helmert, …)

Multiple Regression OLS - not robust

model4 <- lm(life_satisfaction ~ neuroticism + materialism + marital_status, data = df_z)
summary(model4)
## 
## Call:
## lm(formula = life_satisfaction ~ neuroticism + materialism + 
##     marital_status, data = df_z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57970 -0.56966  0.04781  0.54530  1.92929 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.14650    0.09058  -1.617  0.10672    
## neuroticism                     -0.50283    0.04718 -10.658  < 2e-16 ***
## materialism                     -0.14070    0.04882  -2.882  0.00420 ** 
## marital_statuscivil partnership  0.38642    0.12197   3.168  0.00167 ** 
## marital_statusmarried            0.04233    0.11487   0.368  0.71273    
## marital_statusother              0.19767    0.22453   0.880  0.37927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8223 on 348 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.325,  Adjusted R-squared:  0.3153 
## F-statistic: 33.52 on 5 and 348 DF,  p-value: < 2.2e-16

Residual Analysis

par(mfrow=c(2,2))
plot(model4)

Residuals vs. Fitted

  • can be used to check the Linearity assumption
  • residuals should be patternless around \(y = 0\)
  • if not, there is an unaccounted pattern

Normal Q-Q

  • can be used to check the Normality assumption
  • if residuals deviate from the line, they do not conform to a theoretical normal curve

Scale-Location

  • can be used to check Homoscedasticity assumption
  • positive/negative trends indicate that variability is not constant

Residuals vs Leverage

  • can be used to check for influential cases and outliers
  • points with high leverage (unusual X-values) and/or high absolute residuals can have an undue influence on model parameters

Bootstrapped Regression

set.seed(42) # for reproducability
bt_regs <- df_z %>% 
  rsample::bootstraps(1000) %>% # better 10k - I just don't want to wait
  pull(splits) %>% 
  map_dfr(~lm(life_satisfaction ~ neuroticism + materialism + marital_status, data = .) %>% 
            broom::tidy())
bt_regs %>% 
  group_by(term) %>% 
  dplyr::summarize(
    est_mean = mean(estimate),
    est_median = median(estimate),
    low=quantile(estimate, .025),
    high=quantile(estimate, .975)
  ) %>% toTable()
term est_mean est_median low high
(Intercept) -0.1474578 -0.1501277 -0.3139247 0.0242419
marital_statuscivil partnership 0.3901911 0.3958104 0.1372584 0.6227935
marital_statusmarried 0.0406927 0.0413211 -0.1792816 0.2465103
marital_statusother 0.1954798 0.1894392 -0.1419469 0.5465822
materialism -0.1389287 -0.1397394 -0.2312371 -0.0339778
neuroticism -0.5043113 -0.5034426 -0.6070621 -0.4057855

Bootstrapped Regression

Bootstrapped Regression

model5 <- model4
coeffs <- setNames(as.list(bt_summary$est_mean), bt_summary$term)
for (i in 1:length(coef(model5))){
  model5$coefficients[i] <- coeffs[[names(coef(model5))[i]]]
}
coef model4 model5
(Intercept) -0.146 -0.147
neuroticism -0.503 -0.504
materialism -0.141 -0.139
marital_statuscivil partnership 0.386 0.390
marital_statusmarried 0.042 0.041
marital_statusother 0.198 0.195

Bootstrapped Regression

  • a 95% confidence interval for each parameter can be found by taking the middle 95% of each bootstrap distribution [aka percentile method]. If the interval includes 0, then the parameter is non-significant for chosen \(\alpha\)-level.
  • to get omnibus information, use the mean/median model to get predicted values and calculate what you need
m5_predictions <- model5 %>% predict(df_z)
caret::postResample(m5_predictions, df_z$life_satisfaction)
##      RMSE  Rsquared       MAE 
## 0.8152959 0.3250205 0.6527729
  • even better: Bootstrap the calculation to get confidence intervals for everything
  • if there are no assumption violations, bootstrapping yields ~ same results as standard OLS
  • if there are assumption violations, the results are more robust

Regression with Interaction aka Moderation

  • when an interaction is added, centering does change p-values (see Afshartous & Preston, 2011)
  • instead there is a range of significance (subset within a continuum of conditional effects in that is statistically significant)
  • therefore, DON’T call it main and interaction effect
  • better: linear and conditional effect
  • if you include interaction add:

Regression with Interaction

Model

model6 <- lm(life_satisfaction ~ neuroticism * materialism + marital_status, data = df_z)
summary(model6)
## 
## Call:
## lm(formula = life_satisfaction ~ neuroticism * materialism + 
##     marital_status, data = df_z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57602 -0.56957  0.04574  0.54717  1.93154 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.147688   0.091167  -1.620  0.10615    
## neuroticism                     -0.503254   0.047357 -10.627  < 2e-16 ***
## materialism                     -0.140181   0.049047  -2.858  0.00452 ** 
## marital_statuscivil partnership  0.386010   0.122187   3.159  0.00172 ** 
## marital_statusmarried            0.041161   0.115379   0.357  0.72150    
## marital_statusother              0.196278   0.225101   0.872  0.38384    
## neuroticism:materialism          0.005336   0.040898   0.130  0.89627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8235 on 347 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3134 
## F-statistic: 27.85 on 6 and 347 DF,  p-value: < 2.2e-16

Interaction Plot

interactions::interact_plot(
  model6, pred = materialism, modx = neuroticism, plot.points = T)

Johnson-Neyman intervals

interactions::johnson_neyman(model6, pred=materialism, modx=neuroticism, control.fdr=T)
## JOHNSON-NEYMAN INTERVAL 
## 
## When neuroticism is INSIDE the interval [-1.02, 0.66], the slope of
## materialism is p < .05.
## 
## Note: The range of observed values of neuroticism is [-2.29, 2.17]
## 
## Interval calculated using false discovery rate adjusted t = 2.36

Homework (graded)

due till 2022-11-10

  • build a robust model (without interaction) using the function MASS::rlm()
  • describe the differences you see compared to model4

Thank you for your attention!

Next Time (2022-11-17):

Nested Data